This vignette will build upon concepts developped in the first two case studies, to show how SDT can be used to train and apply a species distribution model.
In [1]:
# Load the required packagesusingSpeciesDistributionToolkitconst SDT = SpeciesDistributionToolkit # This is a no-overhead little shortcut for the package nameusingStatisticsusingCairoMakieCairoMakie.activate!(px_per_unit=6.0)importRandomRandom.seed!(12345678)
Random.TaskLocalRNG()
The next steps cover the same grounds as the first two case studies, so we will not go into the details of the code:
In [2]:
# Get the polygon of interest from OpenStreetMap dataplace =getpolygon(PolygonData(OpenStreetMap, Places), place="Paraguay")extent = SDT.boundingbox(place)provider =RasterData(CHELSA2, BioClim)L = SDMLayer{Float16}[SDMLayer(provider; layer=l, extent...) for l inlayers(provider)]mask!(L, place);
We now get the species occurrence data from GBIF - these are the same as in the first case study. Note that to follow the GBIF best practices, we download this dataset using its DOI. This is recommended both to avoid sending multiple requests to the API, but also to ensure that the dataset can be properly cited. When using the download version, the records are returned in the format of the OccurrencesInterface package, which is transaprently usable by SpeciesDistributionToolkit.
In [3]:
# Get occurrence data from GBIFtx =taxon("Akodon montensis")presences = GBIF.download("10.15468/dl.d3cxpr")
┌ Info: Malformed date for record 665803345 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 665803344 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 665798817 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 665798816 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 665792904 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 665787173 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 665787172 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 665787171 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521452319 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521451615 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521451537 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521451105 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521451085 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521450190 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521449733 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521449188 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521448771 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521448650 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521447074 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521446720 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521446167 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521445978 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521445843 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521445769 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521444206 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521444042 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521443694 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521443151 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521443041 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521442777 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521442053 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521442016 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521441643 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521441640 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521441618 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
┌ Info: Malformed date for record 2521440184 - date will be missing
└ @ GBIF /home/tpoisot/.julia/packages/GBIF/cqJfE/src/download.jl:114
To generate pseudo-absence data, we start by projecting the occurrence to a layer. This will result in a spatially thined layer in which cells without an occurrence have a value of false, and those with at least an occurrence have a value of true.
We now define the structure for a decision tree: the data will be transformed through a PCA, then classified using a decision tree. Note that the model call accepts, in that order, the layers with covariates, the layer with presences, and the layer with background points:
In [9]:
# Specification of the SDMsdm =SDM(PCATransform, DecisionTree, L, presencelayer, bgpoints)
PCATransform → DecisionTree → P(x) ≥ 0.5
As an important sidenote, SDeMo will never perform a PCA on the entire layer, as this would immediately create data leakage. The PCA is trained only on the presence data (as we have generated pseudo-absences through a heuristic), unless the absence data are specifically requested.
As we do not have a large volume of data, we will limit the number of nodes in the tree to twelve, with a maximal depth of five. This avoids trees that are too dense or too deep, and can protect against overfitting. Note that SDeMo always prunes trees after training.
We can now perform k-fold cross-validation using all layers we downloaded. The crossvalidate function returns two arrays of confusion matrices, which can then be passed to one of the (many) evaluation measures SDeMo supports. We look at the mean and 95% CI of the Matthews correlation coefficient, to ensure that the model is trainable, and that the tree does not overfit.
Because we have likely included many irrelevant predictors, we will perform variable selection. We start by identifying a series of variables with low colinearity using the stepwise variance inflation factor, then perform backward selection within this pool:
In [12]:
variables!(sdm, AllVariables, folds) # We reset the model to use all featuresvariables!(sdm, StrictVarianceInflationFactor{100.0}; included=[1, 12])@info"Variables after VIF: $(join(string.(variables(sdm)), ", ", ", and "))"variables!(sdm, ForwardSelection, folds)@info"Variables after selection: $(join(string.(variables(sdm)), ", ", ", and "))"cv =crossvalidate(sdm, folds);println("""Validation: $(round(mcc(cv.validation); digits=2)) ± $(round(ci(cv.validation); digits=2))Training: $(round(mcc(cv.training); digits=2)) ± $(round(ci(cv.training); digits=2))""")
┌ Info: Variables after VIF: 1, 2, 4, 5, 8, 9, 10, 12, 13, 14, 15, 18, and 19
└ @ Main /home/tpoisot/Projects/ms_sdt_software/appendix/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X33sZmlsZQ==.jl:3
┌ Info: Variables after selection: 15, 4, and 1
└ @ Main /home/tpoisot/Projects/ms_sdt_software/appendix/jl_notebook_cell_df34fa98e69747e1a8f8a730347b8e2f_X33sZmlsZQ==.jl:5
Validation: 0.92 ± 0.06
Training: 0.95 ± 0.01
We can check which variables have been retained. They are printed out after being shuffled to prevent the temptation of looking for meaning in their order.
In [13]:
Random.shuffle(variables(sdm))
3-element Vector{Int64}:
4
1
15
We can finally train the model using all the available training data:
In [14]:
train!(sdm)
PCATransform → DecisionTree → P(x) ≥ 0.338
Note that as part of the training, the threshold of the model (the score above which a prediction of “presence” is made) is automatically adjusted. The threshold can be turned on and off at prediction time (to get either the score or the range) using the threshold keyword.
The initial prediction of the model can be made by calling the predict method on the model followed by a vector of layers, in which case it will be returned as a layer. Other methods for predict exist, and they all accept many keywords.
As we used a small tree, it may be beneficial to turn it into an ensemble model. We do so by creating a Bagging object with 30 copies of the model, and we additionally bag the features used for each tree:
In [16]:
ensemble =Bagging(sdm, 20)bagfeatures!(ensemble)
{PCATransform → DecisionTree → P(x) ≥ 0.338} × 20
When training the ensemble, SDeMo will retrain every tree using the assigned variables and observations:
In [17]:
train!(ensemble)
{PCATransform → DecisionTree → P(x) ≥ 0.338} × 20
As this is a bagging ensemble, we can calculate the out of bag error (one minus accuracy):
In [18]:
outofbag(ensemble) |> (x) ->1-accuracy(x)
0.062211981566820285
When predicting from an ensemble, the additional keyword consensus can be used, to specify how the predictions from each tree are used. Note that any model that SDeMo knows about can be used as part of an ensemble, and the discussion here is not specific to trees. The consensus function needs to accept and array and return a value, and the default is the median. SDeMo provides the built-in iqr (inter-quartile range) and majority (class recommended by the most trees). We can plot the median prediction of the model:
The code to plot the areas where a majority of trees predict that the species may be present is very similar: note that we use threshold=true to work on boolean predictions, and consensus=majority to get the most frequently recommended class.
We will now demonstrate how this prediction can be clipped by the elevational range of the species. We get the elevation information from another data provider (WordClim v2):
In [22]:
# Get the elevation dataelevprov =RasterData(WorldClim2, Elevation)elev =convert(SDMLayer{Float16}, SDMLayer(elevprov; resolution=0.5, extent...))
🗺️ A 999 × 1007 layer with 1005993 Float16 cells
Projection: +proj=longlat +datum=WGS84 +no_defs
As the elevation data comes from another provider, there is chance that it will not be compatible with the environmental layers we have. This can be due to a different CRS, or to a different spatial resolution or cell coordinates. For this reason, we will first create a new layer that is similar to the covariates:
In [23]:
elevation =similar(L[1], Float16)
🗺️ A 999 × 1007 layer with 508483 Float16 cells
Projection: +proj=longlat +datum=WGS84 +no_defs
We now interpolate the original elevation layer to the new one. Interpolation currently relies on bilinear filtering, but we are in the process of expanding the method to support arbitraty functions. Note that the interpolation takes care of bringing the data to the correct CRS.
In [24]:
interpolate!(elevation, elev)
🗺️ A 999 × 1007 layer with 1003995 Float16 cells
Projection: +proj=longlat +datum=WGS84 +no_defs
We can now mask the elevation data using the polygon for Paraguay:
As in the first case study, we can read the elevation for known occurrences directly, and feed this to the extrema function to get the limit of altitudes at which the species has been observed:
In [26]:
# Get the elevation rangeelevation_range =extrema(elevation[presences])
(Float16(60.0), Float16(458.5))
We can create a mask specifying where the altitude is within the range of the species:
Figure 1: Predicted range of Akodon montensis in Paraguay based on a rotation forest trained on GBIF occurrences and the BioClim variables. The predicted range is clipped to the elevational range of the species. The code to produce this figure is available as Supp. Mat. 3.
Because the model is trained, we can re-use it to e.g. calculate the variable importance (which SDeMo does through non-parametric boostrap in a model-agnostic way):
In [31]:
vi =variableimportance(ensemble, montecarlo(sdm, n=10); threshold=false, samples=100)vi ./=sum(vi)vnames = ["BIO$(x)" for x invariables(ensemble)]collect(zip(vnames, vi))
"Precipitation Seasonality (Coefficient of Variation)"
In the last part of this vignette, we will compare two techniques to explain the predictions made by the ensemble model.
Partial responses are commonly used in SDMs, and represent the prediction from a model when all variables but one are averaged (regular partial responses) or sampled at random from their distribution (inflated partial responses). These provide information about the ways in which the model respond to change in a variable of interest, compared to the “background” of all other variables.
SDeMo has code to perform the calculation of Shapley values. The explain method will perform the Monte-Carlo analysis on a pre-set number of samples. By contrast to the partial responses, Shapley values are calculated at the scale of a single prediction, and represent the departure from the average model prediction due to the specific value of the variable of interest for this prediction. References in the main text provide illustrations and explanations of how Shapley values can be used. To make sure that the output is comparable to other partial responses, we add the average prediction of the model to the Shapley values:
Finally, we will plot the Shapley responses, overlaid on top of the partial and inflated partial responses for this variable. This figure is presented in the main text:
Figure 2: Partial responses (red) and inflated partial responses (grey) to the most important variable. In addition, the Shapley values for all training data are presented in the same figure; green points are presences, and pale points are pseudo-absences. Shapley values were added to the average model prediction to be comparable to partial responses. The code to produce this figure is available as Supp. Mat. 3.